Search results for "Content extraction"

showing 4 items of 4 documents

Readability and the Web

2012

Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way.

060201 languages & linguisticsMeasure (data warehouse)Information retrievalcontent extractionlcsh:T58.5-58.64Relation (database)lcsh:Information technologyComputer Networks and CommunicationsComputer sciencebusiness.industryweb document readability; content extraction; corpus statistics06 humanities and the arts02 engineering and technologycorpus statisticsReadabilityWorld Wide Webweb document readability0602 languages and literatureContent extractionComputingMethodologies_DOCUMENTANDTEXTPROCESSING0202 electrical engineering electronic engineering information engineeringWeb application020201 artificial intelligence & image processingBias correctionbusinessFuture Internet
researchProduct

Content Code Blurring: A New Approach to Content Extraction

2008

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurrin…

Information retrievalComputer sciencebusiness.industryContent (measure theory)Content extractionProcess (computing)Code (cryptography)businessKnowledge acquisitionContent management2008 19th International Conference on Database and Expert Systems Applications
researchProduct

Combining content extraction heuristics

2008

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web …

Information retrievalComputer sciencemedia_common.quotation_subjectDesign elements and principlescomputer.software_genreCrunchTask (project management)Content extractionQuality (business)Data miningHeuristicsWeb documentcomputermedia_commonProceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
researchProduct

Estimating web site readability using content extraction

2009

Nowadays, information is primarily searched on the WWW. From a user perspective, the readability is an important criterion for measuring the accessibility and thereby the quality of an information. We show that modern content extraction algorithms help to estimate the readability of a web document quite accurate.

Information retrievalbusiness.industryComputer sciencemedia_common.quotation_subjectContent extractionQuality (business)UsabilitybusinessReadabilitymedia_commonWeb siteProceedings of the 18th international conference on World wide web
researchProduct